Formula 1 is a racing sport and the pinnacle of mechanical engineering where 10 teams (constructors) race each other for the world championship (mostly glory). We are considering data from 1950, the time that Formula 1 started, to 2017. As in any typical sport, the world championships of Formula 1 racing are decided by a number of parameters such as qualifying race times (these races usually happen before the main race to determine the starting positions of the cars in the main race) , lap times ( the time for a driver to go around a track once) and most importantly pit stop timings (when the tires of the cars are changed, the car is refueled, any broken parts are replaced) to name a few.
This is interesting because the person who is in front of the grid (a grid is the positioning of the cars from the start line of the race) may not necessarily win every time. Teams can follow a variety of different strategies, i.e., perform an undercut, overcut or push when required, in order to gain an advantage in the later stages of a specific race. This variation of data can be analyzed and visualized in numerous ways. Formula 1 being the bleeding edge of mechanical engineering provides a framework for the cars that are going to be designed in the future (usually the automobile technology that is being used in these cars, take a lot of time to be incorporated into normal cars) and therefore this analysis is significant.
Questions of Interest
â How many drivers who did not start the race in the first position went on to win?
â How many times has a constructor(teams) won at that specific venue over the years?
â How has the pitstop times changed over the years?
â Who is the driver who has won the most number of formula 1 championships in history?
Expected findings
â Our analysis indicates that over a period of time, there is a significant reduction in the pitstop time.
â The driver who has the maximum number of race wins, points and championships is Michael Schumacher.
â Data scraping and cleaning
Sumukh Sharma
10/18/2019
â Managing missing data
Sumukh Sharma
10/22/2019
We have created a document which has all the basic information about how we need to plug in data and manage missing information. It is as follows:
Formula 1 data mapping and managing:
Circuits: **
circuitId - Use as key
circuitRef
name
location
country â Useful to specify by country
lat - latitude
lng - longitude
alt
url
Constructors: **
constructorId â use as key
constructorRef
name
nationality
url
Drivers: **
driverId â Use as Key
driverRef
number â Fill blanks with -1
code
forename
surname
dob
nationality
url
Races: **
raceId â Use as Key
year â Season of the race
round â Race number in that year
circuitId â Same as Circuitid from circuit
name
date
time â Fill blanks with 10:00
url
Results: ***
resultId
raceId â Use key from Races
driverId â Use key from Drivers
constructorId â use key storm constructors
number â Driver number(5 for vettel, May not be unique). Fill Blanks with -1
grid â Starting grid position
position â Finishing position. Fill blanks with -1 for retired cards
positionText â Position Text. R stands for retired
positionOrder â Final position order ( This ideally should be the same as position 1âŚn where n is
number of cars that took part in that race)
points â Points won
laps â Number of laps driven in a race. Any driver that retired will have a lower number than those
that finished for a specific RaceID
time
milliseconds
fastestLap â Gives fastest lap number
rank â Rank of the fastest lap times
fastestLapTime â Lap time
fastestLapSpeed â Avg speed in the fastest lap
statusId â StatusId of the car at the end of the race. Use key from Status ( Finished. Did not finish.
Disqualified. Power unit issues. Etc.)
Seasons: *
Year - year
url
Status - **
statusId â Use this as a key
status â Status of the car at the end of the race and the reason for retiring.
Constructor results: *
constructorResultsId â Use as Key
raceId - Use key from Races
constructorId â Use key from Constructors
points â Points scored
status â Null/ D (D for disqualified)
Constructor Standings:*
constructorStandingsId
raceId â Use key from Races
constructorId â Use key from Constructors
points â Points scored â Fill Blanks with 0
position â Position of constructor
positionText - Text
wins â Wins in that year
#### * indicate the importance of the data tables
#### : foundation data table
#### : Important data table
#### : Critical data table
â Indexing, selection and filtering
Malav Parekh & Pratik Pandey
11/3/2019
# Importing required modules
import pandas as pd
import numpy as np
# Reading the 'driverStandings.csv' file and creating a copy of it.
driverStandings = pd.read_csv('driverStandings.csv')
driverStandingsCopy = driverStandings.copy()
#driverStandingsCopy
# Renaming its columns
driverStandingsCopy.rename(columns = {'driverStandingsId':'driverStandingsId', 'raceId':'raceId', 'driverId':'driverId',
'points':'Points', 'position':'Position',
'positionText':'positionText', 'wins':'Wins'}, inplace = True)
# Setting the index to 'driverStandingsId'
driverStandingsCopy.set_index('driverStandingsId',inplace=True)
driverStandingsCopy
# Reading the 'results.csv' file and creating a copy of it.
results = pd.read_csv('results.csv')
resultsCopy= results.copy()
#resultsCopy
# Renaming its columns
resultsCopy.rename(columns = {'resultId':'resultId', 'raceId':'raceId', 'driverId':'driverId',
'constructorId':'constructorId', 'number':'Number', 'grid': 'Grid',
'position':'Position', 'positionText':'positionText', 'positionOrder': 'Position order',
'points':'Points', 'laps':'Laps', 'time':'Time','milliseconds':'Time in milliseconds',
'fastestLap': 'Fastest Lap', 'rank': 'Rank', 'fastestLapTime': 'Fastest Lap Time',
'fastestLapSpeed':'Fastest Lap Speed', 'statusId':'statusId'}, inplace = True)
resultsCopy.set_index('resultId',inplace=True)
#Handling null values for multiple columns in 'results.csv'
values = {'Position': -1, 'Number': -1, 'Time': 0, 'Time in milliseconds': 0}
resultsCopy.fillna(value=values)
# Opening the circuits.csv file
# It is a TextIO Wrapper file
circuitsWrapper = open('circuits.csv', encoding = "ISO-8859-1")
circuits = pd.DataFrame(circuitsWrapper)
# Appending the valuesinto a list
list_circuits=[]
# Since the values are comma seperated and stored in a list, we split them
rowLabels_circuits=circuits[0][0].split(',')
# Looping through the length of the DataFrame
for i in range(1,74):
values=circuits[0][i].split(',')
list_circuits.append(values)
# Storing the newly appended list in a DataFrame
circuitsUpdated=pd.DataFrame(list_circuits)
#circuitsUpdated
# Creating a copy of the above created DataFrame
circuitsCopy= circuitsUpdated.copy()
#circuitsCopy
# Renaming the columns
circuitsCopy.rename(columns = {0:'circuitId', 1:'CircuitRef', 2:'Name', 3:'Location',
4:'country', 5:'Latitude', 6:'Longitude', 7:'Altitude', 8:'URL'}, inplace = True)
#circuitsCopy
# Setting the index as 'circuitId'
circuitsCopy.set_index('circuitId',inplace=True)
# # Dropping redundant columns
circuitsCopy.drop(['Altitude', 9], axis=1, inplace=True)
circuitMap=circuitsCopy
circuitsCopy
circuitMap
# Reading the 'constructorResults.csv' file and creating a copy of it.
constructorResults = pd.read_csv('constructorResults.csv')
constructorResultsCopy= constructorResults.copy()
#constructorResultsCopy
# Renaming its columns
constructorResultsCopy.rename(columns = {'constructorResultsId':'constructorResultsId', 'raceId':'raceId',
'constructorId':'constructorId', 'points':'Points',
'status':'Status', 5:'Latitude', 6:'Longitude', 7:'Altitude', 8:'URL'}, inplace = True)
# Setting the index
constructorResultsCopy.set_index('constructorResultsId', inplace=True)
#constructorResultsCopy
#Handling null values
values = {'Status': 'NULL'}
constructorResultsCopy.fillna(value=values)
# Reading the 'constructors.csv' file and creating a copy of it.
constructors = pd.read_csv('constructors.csv')
constructorCopy= constructors.copy()
#constructorCopy
# Renaming the columns
constructorCopy.rename(columns = {'constructorId':'constructorId', 'constructorRef':'ConstructorRef',
'name':'Name', 'nationality':'Nationality', 'url':'URL',
'Unnamed: 5':'Unnamed: 5'}, inplace = True)
# Setting the index
constructorCopy.set_index('constructorId',inplace=True)
# Dropping redundant column
constructorCopy.drop(['Unnamed: 5'], axis=1, inplace=True)
constructorCopy
# Reading the 'constructorStandings.csv' file and creating a copy of it.
constructorStandings = pd.read_csv('constructorStandings.csv')
constructorStandingsCopy = constructorStandings.copy()
#constructorStandingsCopy
# Renaming the columns
constructorStandingsCopy.rename(columns = {'constructorStandingsId':'constructorStandingsId', 'raceId':'raceId',
'constructorId':'constructorId', 'points':'Points', 'position':'Position',
'positionText':'positionText','wins':'Wins', 'Unnamed: 7':'Unnamed: 7'}, inplace = True)
# Setting the index
constructorStandingsCopy.set_index('constructorStandingsId',inplace=True)
# Dropping redundant column
constructorStandingsCopy.drop(['Unnamed: 7'], axis=1, inplace=True)
constructorStandingsCopy
# Opeing the 'drivers.csv' TextIOWrapper file
driversWrapper = open('drivers.csv', encoding = "ISO-8859-1")
drivers = pd.DataFrame(driversWrapper)
# Creating a list
list_drivers=[]
# Splitting the comma seperated values
rowLabels_drivers=drivers[0][0].split(',')
# Looping through the length of the DataFrame
for i in range(1,843):
values=drivers[0][i].split(',')
# Appending the values to the list
list_drivers.append(values)
# Creating a new DataFrame by using the list
driversUpdated=pd.DataFrame(list_drivers)
#driversUpdated
# Creating a copy of the DataFrame
driversCopy= driversUpdated.copy()
#driversCopy
# Renaming the columns
driversCopy.rename(columns = {0:'driverId', 1:'DriverRef', 2:'Number', 3:'Code',
4:'Forename', 5:'Surname', 6:'DOB', 7:'Nationality', 8:'URL'}, inplace = True)
#driversCopy
# Setting the index
driversCopy.set_index('driverId',inplace=True)
# Dropping redundant column
driversCopy.drop([9], axis=1, inplace=True)
driversCopy
# Opening the 'pitStops.csv' file and creating a copy of it after converting it to DataFrame from TextIOWrapper.
pitStopsWrapper = open('pitStops.csv', encoding = "ISO-8859-1")
pitStops = pd.DataFrame(pitStopsWrapper)
#pitStops
# Create a list
list_pitStops=[]
# Splitting the comma seperated values in the wrapper file
rowLabels_pitStops=pitStops[0][0].split(',')
# Looping through the length of the Dataframe
for i in range(1,6252):
values=pitStops[0][i].split(',')
# Appending the values to the list
list_pitStops.append(values)
# Creating a new DataFrame and adding the list to it
pitStopsUpdated=pd.DataFrame(list_pitStops)
#pitStopsUpdated
# Creating a copy of the DataFrame
pitStopsCopy= pitStopsUpdated.copy()
#pitStopsCopy
# Renaming the columns
pitStopsCopy.rename(columns = {0:'raceId', 1:'driverId', 2:'Stop', 3:'Lap',
4:'Time', 5:'Duration', 6:'Duration in milliseconds'}, inplace = True)
#pitStopsCopy
# Setting the index
pitStopsCopy.set_index('raceId',inplace=True)
pitStopsCopy
# Reading the 'races.csv' file and creating a copy of it
races = pd.read_csv('races.csv')
racesCopy = races.copy()
#racesCopy
# Renaming the columns
racesCopy.rename(columns = {'raceId':'raceId', 'year':'Year','round':'Round', 'circuitId':'circuitId', 'name':'Name',
'date':'Date','time':'Time', 'url':'URL'}, inplace = True)
# Setting the index
racesCopy.set_index('raceId',inplace=True)
race2=racesCopy
racesCopy
# Reading the 'seasons.csv' file and creating a copy of it
seasons = pd.read_csv('seasons.csv')
seasonsCopy= seasons.copy()
#seasonsCopy
# Renaming the columns
seasonsCopy.rename(columns = {'year':'Year','url':'URL'}, inplace = True)
# Setting the index
seasonsCopy.set_index('Year',inplace=True)
seasonsCopy
# Reading the 'status.csv' file and creating a copy of it
status = pd.read_csv('status.csv')
statusCopy = status.copy()
#statusCopy
# Renaming the columns
statusCopy.rename(columns = {'statusId':'statusId','status':'Status'}, inplace = True)
# Setting the index
statusCopy.set_index('statusId',inplace=True)
statusCopy
print("Number of winners: "+str(len(results[results['position'] == 1.0])))
resultsCopy=resultsCopy[resultsCopy['Position'] == 1.0]
resultsCopy=resultsCopy[resultsCopy['Grid'] != 1]
print("Number of winners who were not in grid position 1: "+str(len(resultsCopy)))
required_drivers = resultsCopy[resultsCopy['Grid'] > 3].sort_values('Grid',ascending=False).drop_duplicates(subset='Grid',keep='first')
required_drivers
required_drivers = resultsCopy[resultsCopy['Grid'] > 15].sort_values('Grid',ascending=False).drop_duplicates(subset='Grid',keep='first')
required_drivers
df=pd.DataFrame()
df = [ driversCopy[driversCopy.index == j] for i in required_drivers['driverId'] for j in driversCopy.index if int(j) == i]
#df
drivers_url=[]
for i in range(len(df)):
drivers_url.append(df[i].URL)
from bs4 import BeautifulSoup
import requests
from IPython.display import Image
path = "/Users/prati/FormulaOne/Images/"
Image(filename = path + "Watson.jpg", width=250, height=250)
Image(filename = path + "Cross.png", width=250, height=250)
Image(filename = path + "Barrichello.png", width=250, height=250)
Image(filename = path + "Raikkonen.png", width=250, height=250)
Image(filename = path + "Schumi.png", width=250, height=250)
pit_time_list1 = pd.DataFrame(pitStopsCopy.loc['881']['Duration'])
mean_list_for_a_race1 = pit_time_list1['Duration'].astype(float).mean()
mean_list_for_a_race1
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 99)
driverStandingsCopy = driverStandingsCopy.fillna(0)
driverStandingsCopy
driver_wins = {}
for index, row in driverStandingsCopy.iterrows():
if row["Position"] == 1:
if row["driverId"] in driver_wins.keys():
driver_wins[row["driverId"]] += 1
else:
driver_wins[row["driverId"]] = 1
#driver_wins
driverStandingsDf= pd.DataFrame(driverStandingsCopy)
driversDf= pd.DataFrame(driversCopy)
driverStandingsDf=driverStandingsDf.sort_values('driverId')
driversCopy.head()
driversCopy.reset_index(inplace=True)
print(driversCopy.dtypes, driverStandingsCopy.dtypes)
driversCopy['driverId']=driversCopy['driverId'].astype('int')
driverStandings
data=pd.merge(driverStandingsCopy, driversCopy, on='driverId', how='left')
data
data.groupby(['DriverRef']).count().sort_values(by='Wins', ascending=False, inplace=True)
data
data = data[data["Position"]==1]
data["Surname"].value_counts()
import matplotlib.pyplot as plt
#import plotly.plotly as py
import plotly.graph_objs as go
import chart_studio
import chart_studio.plotly as py
colors = ['grey'] * 55
colors[0] = 'crimson'
chart_studio.tools.set_credentials_file(username='pratikp', api_key='AHxozWNTaluAU7sx4YW7')
data1=[go.Bar(x=data["Surname"].value_counts().index, y=data["Surname"].value_counts().values, marker_color=colors)]
py.iplot(data1,filename='Wins by Drivers')
unique_raceid_list = list(pitStopsCopy.index.unique())
pit_stops_average_df = pd.DataFrame(columns=['Year', 'Average_pit_time'])
for year in range(2011,2018):
# print(year)
race_list_for_year = racesCopy[racesCopy['Year']==int(year)].index.unique()
mean_list_for_a_race = []
for race in race_list_for_year:
# print(race)
try:
pit_time_list = pd.DataFrame(pitStopsCopy.loc[str(race)]['Duration'])
mean_list_for_a_race += [pit_time_list['Duration'].astype(float).mean()]
except:
print(race)
mean_list_for_a_race
mean_dfrow_for_a_specific_year = pd.DataFrame(mean_list_for_a_race)
# float(mean_dfrow_for_a_specific_year.mean())
pit_stops_average_df = pit_stops_average_df.append({'Year':int(year),'Average_pit_time': float(mean_dfrow_for_a_specific_year.mean())}, ignore_index=True)
pit_stops_average_df
pit_time_list = pd.DataFrame(pitStopsCopy.loc['969']['Duration'])
# mean = pit_time_list.mean()
meann = pit_time_list['Duration'].astype(float).mean()
#meann
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.lineplot(x=pit_stops_average_df.Year, y=pit_stops_average_df.Average_pit_time);
plt.ylabel('Time')
plt.title('Year')
# plt.xticks(rotation='vertical');
racesCopy['circuitId']=racesCopy['circuitId'].astype('int')
racesCopy=racesCopy.reset_index()
circuitsCopy=circuitsCopy.reset_index()
circuitsCopy['circuitId']=circuitsCopy['circuitId'].astype('int')
#fig.update_layout(barmode='group', xaxis_tickangle=-45)
#fig.show()
data2=pd.merge(circuitsCopy, racesCopy, on='circuitId', how='left')
data2[['circuitId','Name_x','Location','country','raceId']]
constructorStandingsCopy=constructorStandingsCopy.reset_index()
#constructorStandingsCopy
print(type(constructorStandingsCopy))
data3=pd.merge(data2, constructorStandingsCopy, on='raceId', how='left')
pd.set_option("display.max_rows", 12000)
data3.head()
data3[data3['Position']==1.0].head()
data8=data3.groupby(['circuitId','constructorId']).count()
#data8
data7=data3.groupby(['circuitId','constructorId']).count().sort_values(by='Wins',ascending=False)
#data7=data7.reset_index()
data9=data7
#data9
data10=data9.reset_index()
data11=pd.DataFrame(data10.groupby(by='circuitId')['Wins'].max())
#data10.groupby(['circuitId','constructorId']).sort_values(by='Wins',ascending=True).drop_duplicates(keep= 'last')
#data11
#data11=data10[['circuitId','constructorId','Wins']]
#data10
data12=pd.merge(data10, data11, on=['circuitId','Wins'], how='inner')
data13=data12[['circuitId','constructorId','Wins']]
data13.sort_values(by=['circuitId','constructorId'],ascending=True).head()
#data13['Wins'] = data[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
data14=pd.merge(data13, circuitsCopy, on='circuitId', how='left')
data14[['circuitId','constructorId','Wins','Latitude','Longitude']].head()
import plotly.express as px
fig1 = go.Figure(go.Densitymapbox(lat=data14.Latitude, lon=data14.Longitude, z=data14.Wins, radius=25))
fig1.update_layout(mapbox_style="open-street-map", mapbox_center_lon=180)
fig1.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig1.show()
res=results
race=racesCopy
values = {'position': -1, 'number': -1, 'time': 0, 'milliseconds': 0}
res=res.fillna(value=values)
data4=pd.merge(res, race, on='raceId', how='left')
data4=data4[data4['position'] == -1.0]
data5=data4.groupby(['circuitId']).count().sort_values(by='raceId',ascending=False)
data5=data5.reset_index()
data5[['circuitId','number']].head()
data6=pd.merge(data5, circuitsCopy, on='circuitId', how='left')
data6[['circuitId','number','Latitude','Longitude']].head()
#import plotly.graph_objects as go
fig = go.Figure(go.Densitymapbox(lat=data6.Latitude, lon=data6.Longitude, z=data6.number, radius=35))
fig.update_layout(mapbox_style="stamen-terrain", mapbox_center_lon=180)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
circuitMap['Raceloc'] = 0
circuit_id_list = list(race2[race2['Year'] == 2018]['circuitId'])
circuit_id_list
circuit_id_list_notlive_i = list(race2[race2['Year'] != 2018]['circuitId'])
circuit_id_list_notlive = [x for x in circuit_id_list_notlive_i if x not in circuit_id_list]
# len(circuit_id_list_notlive)
# len(circuit_id_list)
# circuitsCopy[~circuitsCopy['circuitId'].isin(circuit_id_list)]['Raceloc'] = 1
circuit_id_list
# circuit_id_list
for rowId in circuit_id_list:
circuitMap.loc[str(rowId),'Raceloc'] = 1
circuitMap.head()
import seaborn as sns
import plotly.graph_objects as go
mapbox_access_token = 'pk.eyJ1Ijoic3VtdWtoc2hhcm1hIiwiYSI6ImNrMm92bjJydzE3c3UzcHQ1Nzdjb3JwNHMifQ.x_ybvMGOCORXoXDYqnJCCg'
class_map_data1 = go.Scattermapbox(
lon = circuitMap[circuitMap['Raceloc']!=1]['Longitude'],
lat = circuitMap[circuitMap['Raceloc']!=1]['Latitude'],
text = circuitMap[circuitMap['Raceloc']!=1]['Name'],
hoverinfo='text',
mode = 'markers',
marker = dict(
color = 'red',
symbol = 'circle',
opacity = .5
),
name = "Not live circuits"
)
class_map_data2 = go.Scattermapbox(
lon = circuitMap[circuitMap['Raceloc']==1]['Longitude'],
lat = circuitMap[circuitMap['Raceloc']==1]['Latitude'],
text = circuitMap[circuitMap['Raceloc']==1]['Name'],
hoverinfo='text',
mode = 'markers',
marker = dict(
color = 'green',
symbol = 'circle',
opacity = 1
),
name = "Live circuits"
)
class_map_layout = go.Layout(
title = 'Circuit Locations',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
zoom=1
)
)
class_map = go.Figure(data=[class_map_data1,class_map_data2], layout=class_map_layout)
class_map.show()